load("assignment3.rdata")
library(survey)
library(dplyr)
library(ggplot2)
library(stringr)
library(broom)
ANS:
LBXGLU: Fasting Glucose (mg/dL):
Participants aged 12 years and older who were examined in the morning session were tested. Glucose concentration was determined by a hexokinase method. It is an endpoint enzymatic method with a sample blank correction.
Blood specimens were processed, stored and shipped to Fairview Medical Center Laboratory at the University of Minnesota, Minneapolis Minnesota for analysis. Detailed specimen collection and processing instructions are discussed in the NHANES LPM.
_BMXBMI: Body Mass Index (kg/m**2)_
BMI values are calculated for National Health and Nutrition Examination Survey (NHANES) participants using measured height and weight values as follows: weight (kilograms)/height squared (meters)
All survey participants were eligible for the body measurement component. The body measurement data were collected by trained health technicians. The health technician was accompanied by a recorder during each body measurement examination. The respondent’s age, at the time of the screening interview, determined the body measurement examination protocol for survey participants
Framingham Heart Study or the Nurses Health Study? (1 point)
ANS:
The NHANES sample for the survey is selected to represent the U.S. population of all ages. To produce reliable statistics, NHANES over-samples persons 60 and older, African Americans, and Hispanics.
While other studies, like Framingham Heart Study or Nurses Health Study are representative of a particular subgroup of population.
different types of biomarkers of environmental exposures from the ExposureList array and query the NHANES website for the assay description of the biomarker (5 points).
Environmental exposures:
- LBXBCD - Cadmium (ug/L)
Description: A cadmium assay is performed to identify cases of cadmium toxicity. Occupational exposure is the most common cause of elevated cadmium levels.
Target: Samples: Participants aged 1 year and older who do not meet any of the exclusion criteria are eligible.
Data Process: Whole blood specimens are processed, stored, and shipped to the Division of Laboratory Sciences, National Center for Environmental Health, and Centers for Disease Control and Prevention for analysis.
Environmental Exposures: - URXUCO - Cobalt, urine (ug/L) - URXUCS - Cesium, urine (ug/L) Description: Trace metals have been associated with adverse health effects in occupational studies or laboratory studies, but have not been monitored in general population groups.
This method is used to achieve rapid and accurate quantifications of multiple elements of toxicological and nutritional interest. The method is sensitive and rapid enough to screen urine specimens from subjects suspected to be exposed to a number of important toxic elements or to evaluate environmental or other nonoccupational exposure to these same elements.
Target Samples: Participants aged 6 years and older who met the subsample requirements.
Data Process:
Urine specimens are processed, stored, and shipped to the Division of Environmental Health Laboratory Sciences, National Center for Environmental Health, Centers for Disease Control and Prevention for analysis
Environmental Exposures: LBXCOT - Cotinine (ng/mL)
Description:
The specific aims of the component are:
The tobacco component for NHANES will include questionnaire items on current and past use of cigarettes, pipes, cigars and smokeless tobacco. Exposure to ETS at home and at work and in-utero ETS exposure among children will also be obtained. ETS exposure will also be assessed for examinees 3 years of age and older through the measurement of serum cotinine, a metabolite of nicotine. In addition, use of nicotine replacement products (e.g., gum and patch) will be collected using questionnaires.
Target Samples: Participants aged 3 years and older who do not meet any of the exclusion criteria are eligible.
Data Process:
Serum specimens are processed, stored, and shipped to the Division of Environmental Health Laboratory Sciences, National Center for Environmental Health, Centers for Disease Control and Prevention for analysis
demographic characteristics of your population in the training dataset. (36 points total from a.)-i.); 37 points total)
mean and median. Is the distribution skewed? (1 point)
BMI <- NHData.train %>% filter(!is.na(BMXBMI))
BMI <- BMI %>% mutate(sex = case_when( male == 1 ~ "male",
female == 1 ~ "female"),
race = case_when(white == 1 ~ "white",
black == 1 ~ "black",
mexican == 1 ~ "mexican",
other_hispanic == 1 ~ "hispanic",
other_eth == 1 ~ "others"))
BMI$race <- factor(BMI$race, levels = c("white","black","mexican","hispanic","others"))
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=BMI)
svyhist(~ BMXBMI, dsn)
svymean(~ BMXBMI, dsn)
## mean SE
## BMXBMI 25.839 0.1205
svyquantile( ~ BMXBMI, dsn, quantiles = 0.5)
## 0.5
## BMXBMI 25.19
dim(BMI)
## [1] 17472 195
ANS: 17472, mean: 25.839 Median: 25.19 The distribution is right skewed
svyplot( BMXBMI ~ RIDAGEYR, design = dsn)
svyboxplot( BMXBMI ~ sex, design = dsn)
svyboxplot(BMXBMI ~ white + black + mexican + other_hispanic + other_eth, dsn,
names=c("White","Black","Mexican", "Other Hispanic", "Others"),
xlab = "Race/Ethnicity",
ylab = "BMI")
(INDFMPIR). The income-to-poverty ratio is a the household income divided by the household poverty level for a given survey year. Therefore a Income-to-poverty ratio of 1 means the individual has household income equal to the poverty level.
svyplot(BMXBMI ~ INDFMPIR, dsn , xlab = "Income-to-poverty ratio", ylab = "BMI")
ANS: 1. BMI seems to slightly increase with age in the beginning, and achieve it’s peak between 40 to 50, then the decrease with age.
2. BMI seems no significantly different between sex, and race.
3. BMI slightly increased with income-to-poverty ratio, but may not be significant.
in (f)? (2 points)
ANS: Linear regression. The coefficient of the variables will tell the effects, and the p-value will tell the significance.
fit <- svyglm(BMXBMI ~ RIDAGEYR, data = BMI, design = dsn)
fit %>% tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 20.6623230 0.144659045 142.83464 1.233947e-41
## 2 RIDAGEYR 0.1455033 0.003440554 42.29064 6.435068e-27
ANS: A 1 year change in age increase the BMI by 0.145, p-value is \(6.4 * 10^{-27}\), which is significant.
fit <- svyglm(BMXBMI ~ sex, data = BMI, design = dsn)
fit %>% tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 26.0653755 0.1578484 165.12912 2.136003e-43
## 2 sexmale -0.4646725 0.1395459 -3.32989 2.445986e-03
ANS: Average BMI in female is 26.06 and in male is 25.60. p-value is \(2.45 * 10^{-3}\), which is significant
fit <- svyglm(BMXBMI ~ race, data = BMI, design = dsn)
fit %>% tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 25.87563473 0.1531379 168.96953558 9.399238e-40
## 2 raceblack 0.63332347 0.2267175 2.79344726 9.860121e-03
## 3 racemexican -0.67685827 0.2064778 -3.27811575 3.066543e-03
## 4 racehispanic -0.02531159 0.3276991 -0.07724035 9.390472e-01
## 5 raceothers -1.08260966 0.5238973 -2.06645392 4.929230e-02
ANS: Average BMI and significance:
White: 25.876 (reference) Black: 26.509, p-value = 0.0098 , BMI is significantly different between black and white
Mexian:25.199, p-value = 0.0031 , BMI is significantly different between mexican adnd white
Other Hispanic: 25.85, p-value = 0.939, BMI is not significantly different between other hispanic and white Others: 24.793, p-value = 0.0493, BMI is borderline significantly different between others and white
GLU <- NHData.train %>% filter(!is.na(LBXGLU))
GLU <- GLU %>% mutate(sex = case_when( male == 1 ~ "male",
female == 1 ~ "female"),
race = case_when(white == 1 ~ "white",
black == 1 ~ "black",
mexican == 1 ~ "mexican",
other_hispanic == 1 ~ "hispanic",
other_eth == 1 ~ "others"))
GLU$race <- factor(GLU$race, levels = c("white","black","mexican","hispanic","others"))
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=GLU)
i.a)
svyhist(~ LBXGLU, dsn)
svymean(~ LBXGLU, dsn)
## mean SE
## LBXGLU 100.23 0.5817
svyquantile( ~ LBXGLU, dsn, quantiles = 0.5)
## 0.5
## LBXGLU 94.6
dim(GLU)
## [1] 6476 195
ANS: There are 6476 individuals without NA, mean: 100,23 Median: 94.6 The distribution is more right skewed than BMI.
i.b)
svyplot( LBXGLU ~ RIDAGEYR, design = dsn)
i.c)
svyboxplot(LBXGLU ~ sex, design = dsn)
i.d)
svyboxplot(LBXGLU ~ white + black + mexican + other_hispanic + other_eth, dsn,
names=c("White","Black","Mexican", "Other Hispanic", "Others"),
xlab = "Race/Ethnicity",
ylab = "Glucose")
i.e)
svyplot(LBXGLU ~ INDFMPIR, dsn , xlab = "Income-to-poverty ratio", ylab = "Glucose")
i.f) ANS: 1. Glucose do not have apparent linear relationship with age, but as age increases, the variation of glucose levels also increase 2. Through the boxplot it is hard to tell the different between sex, and race of glucose level.
3. No significant linear relationship is observed between glucose and income-to-poverty ratio
i.g) ANS:
The same as bmi, we will use linear regression. The coefficient of the variables will tell the effects, and the p-value will tell the significance.
i.h.i) ANS:
fit <- svyglm(LBXGLU ~ RIDAGEYR , design = dsn)
fit %>% tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 84.0609374 1.09386378 76.84772 4.066568e-34
## 2 RIDAGEYR 0.3898802 0.02727663 14.29356 2.166597e-14
ANS: A 1 year change in age increase the glucose by 0.390, p-value is \(2.167 * 10^{-14}\), which is significant.
i.h.ii)
fit <- svyglm(LBXGLU ~ sex , design = dsn)
fit %>% tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 97.865578 0.7188393 136.143887 4.719289e-41
## 2 sexmale 4.832581 0.7977256 6.057948 1.566102e-06
ANS: Average glucose in female is 97.866 and in male is 102,699. p-value is \(1.56 * 10^{-6}\), which is significant
i.h.iii.)
fit <- svyglm(LBXGLU ~ race , design = dsn)
fit %>% tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 100.0431624 0.6599272 151.59727005 1.412246e-38
## 2 raceblack -1.4418470 1.2949692 -1.11342190 2.761185e-01
## 3 racemexican 0.6420659 1.0560965 0.60796135 5.486969e-01
## 4 racehispanic 4.3031206 3.0142098 1.42761151 1.657767e-01
## 5 raceothers 0.2109915 3.3379627 0.06320968 9.501022e-01
ANS:
Average Glucose and significance:
White: 100.04 (reference) Black: 98.60, p-value = 0.276 , BMI is not significantly different between black and white
Mexian:100.68, p-value = 0.548 , BMI is not significantly different between mexican adnd white
Other Hispanic: 104.34, p-value = 0.166, BMI is not significantly different between other hispanic and white Others: 100.254, p-value = 0.0493, BMI is not significantly different between others and white
The glucose is not statiscally different between different races
Index? (e.g., what demographic characteristics are associated with both BMI and glucose?) (1 point) ANS:
1. Both BMI and glucose increase with age and are both significant. 2. BMI in female is larger than in male, while the glucose is the opposite - glucose in male is higher than in female. Both of them are significantly different. 3. Black and Mexican have significantly higher BMI than White, but the glucose have no significant difference between all groups.
Specifically, we will explore two biomarkers of exposure and nutrition, including a heavy metal exposure that recently made the headlines in Flint, Michigan: serum Lead (LBXBPB). (13 points total)
BPB <- NHData.train %>% filter(!is.na(LBXBPB))
BPB <- BPB %>% mutate(sex = case_when( male == 1 ~ "male",
female == 1 ~ "female"),
race = case_when(white == 1 ~ "white",
black == 1 ~ "black",
mexican == 1 ~ "mexican",
other_hispanic == 1 ~ "hispanic",
other_eth == 1 ~ "others"))
BPB$race <- factor(BPB$race, levels = c("white","black","mexican","hispanic","others"))
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=BPB)
distribution. (1 point)
svyhist(~ LBXBPB, dsn)
svymean(~ LBXBPB, dsn)
## mean SE
## LBXBPB 1.9738 0.0293
svyquantile( ~ LBXBPB, dsn, quantiles = 0.5)
## 0.5
## LBXBPB 1.5
dim(BPB)
## [1] 16915 195
ANS: The distribution is highly right skewed, mean serum lead is 1.97, and the median is 1.5
what transformation would you use here (hint: think exponential decay)? Why would you want to make a dependent variable look more “normal” (hint: think interpretation) (2 points)
ANS: Take log transformation can make the data distribution less skewed. Making the data looks more normal with log transform let us able to fit our data into some general linear model, like t-test or regression.
svyplot(log(LBXBPB) ~ RIDAGEYR, dsn, xlab = "Age", ylab = "Lead")
svyboxplot(log(LBXBPB) ~ sex, dsn, xlab = "Gender", ylab = "Lead")
#### e.) Plot a boxplot of LBXBPB versus race/ethnicity (1 point)
svyboxplot(log(LBXBPB) ~ race, dsn, xlab = "Race", ylab = "Lead")
#### f.) Plot a plot of LBXBPB versus an indicator of socioecononomic status, the income to poverty ratio (INDFMPIR) (1 point)
svyplot(log(LBXBPB) ~ INDFMPIR, dsn, xlab = "Income-to-Poverty Ratio", ylab = "Lead")
points).
fit <- svyglm(log(LBXBPB) ~ RIDAGEYR, data = BPB, design = dsn)
fit %>% tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 0.096896357 0.0295123528 3.283247 2.755865e-03
## 2 RIDAGEYR 0.009295286 0.0005868151 15.840229 1.657708e-15
ANS: A 1 year change in age will increase log(LBXBPB) by 0.0093, which is about 1.01 in lead levels. P-value is \(1.65*10^{-15}\), which is significant.
fit <- svyglm(log(LBXBPB) ~ sex, data = BPB, design = dsn)
fit %>% tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 0.2432349 0.01464807 16.60526 5.004044e-16
## 2 sexmale 0.3940273 0.01232012 31.98242 1.375464e-23
ANS: Average Lead in female is \(e^{0.243} = 1.275\), in male is \(e^{0.637} = 1.891\), p-value is \(1.37 * 10^{-23}\), which is significantly different
Whites, and Other Hispanic vs. White, and Other versus White (2 points).
fit <- svyglm(log(LBXBPB) ~ race, data = BPB, design = dsn)
fit %>% tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 0.41779385 0.01418008 29.4634342 6.199649e-21
## 2 raceblack 0.14836175 0.02021919 7.3376714 1.094140e-07
## 3 racemexican 0.06544387 0.03388176 1.9315367 6.482823e-02
## 4 racehispanic -0.01582878 0.04373909 -0.3618909 7.204745e-01
## 5 raceothers -0.06601124 0.02856402 -2.3109922 2.936507e-02
fit %>% tidy() %>% select(estimate) %>% exp()
## estimate
## 1 1.5186076
## 2 1.1599324
## 3 1.0676328
## 4 0.9842958
## 5 0.9361203
ANS: Average Lead levels and significance:
White: 1.518 (reference) Black: 1.159, p-value = \(1.09 * 10^{-7}\) , Lead levels is significantly different between black and white
Mexian:1.067, p-value = 0.065 , Lead levels is not significantly different between mexican adnd white
Other Hispanic: 0.984, p-value = 0.720, Lead levels is not significantly different between other hispanic and white Others: 0.936, p-value = 0.0294, Lead levels is significantly different between others and white
points)
7.a)
VitD <- NHData.train %>% filter(!is.na(LBXVID))
VitD <- VitD %>% mutate(sex = case_when( male == 1 ~ "male",
female == 1 ~ "female"),
race = case_when(white == 1 ~ "white",
black == 1 ~ "black",
mexican == 1 ~ "mexican",
other_hispanic == 1 ~ "hispanic",
other_eth == 1 ~ "others"))
VitD$race <- factor(VitD$race, levels = c("white","black","mexican","hispanic","others"))
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=VitD)
svyhist(~ LBXVID, dsn)
svymean(~ LBXVID, dsn)
## mean SE
## LBXVID 23.76 0.385
svyquantile( ~ LBXVID, dsn, quantiles = 0.5)
## 0.5
## LBXVID 23
dim(VitD)
## [1] 7807 195
ANS: There are 7807 individuals with Vitamin D. Mean:23.76, Media:23, The distribution is normally distributed, with mildly right skewed
7.c)
svyboxplot(LBXVID ~ sex, dsn, xlab = "sex", ylab = "Vitamin D")
7.d)
svyboxplot(LBXVID ~ race, dsn, xlab = "race", ylab = "Vitamin D")
7.e)
svyplot(LBXVID ~ INDFMPIR, dsn , xlab = "Income-to-poverty ratio", ylab = "BMI")
ANS:
Demographic variables can be confounders between Lead and BMI. For example, sex is associate with Lead level and BMI, therefore, sex can be a counfounder. Mediators has to be on the causative pathway between exposure(Lead) and outcome(BMI), demographic variables are less likely to be influence by lead level.
Vitamin D shows assoication with age, sex and race, so demorgraphic variables can also be counders between VitD and BMI. Demographic variables, again, is unlikely to a mediator. However, if we look at the association between lead and BMI, vitamin D could potentially be a mediator, if there are some biological evidence being proven. For example, lead level could influence the absorbtion of vitamin D, and cause the BMI change.
Now we will test each of the exposures in ExposureList for linear association with a quantitative or continuous trait in the training and testing datasets separately. Write R code, from scratch, to execute a EWAS called ewas.R for a given dataset (e.g., training or testing) (code and output: 10 total points). Your script will input either a flag to execute the EWAS on the training or testing datasets and output a .csv file that contains the exposure ID (e.g., LBXBPB),exposure name, phenotype name,estimate,standard error, pvalue,false discovery rate (FDR)
As implied in the output above, your code should execute association that tests the association between each exposure in ExposureList and 1 SD of the phenotype (e.g., BMI) for a 1 standard deviation change in the logarithm base 10 of the exposure value. The dependent variable is the phenotype, and the independent variable is an exposure. Adjust all models by age, race/ethnicity, income/poverty ratio (INDFMPIR). Using your R code, execute the following:
flag <- "test"
if(flag == "train"){
data <- NHData.train
}else{
data <- NHData.test
}
data <- data %>% filter(!is.na(BMXBMI))
data <- data %>% mutate(sex = case_when( male == 1 ~ "male",
female == 1 ~ "female"),
race = case_when(white == 1 ~ "white",
black == 1 ~ "black",
mexican == 1 ~ "mexican",
other_hispanic == 1 ~ "hispanic",
other_eth == 1 ~ "others"))
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data= data)
bmi <- data.frame(matrix(vector(), 0, 6))
for(exposure in ExposureList){
f <- str_glue("scale(BMXBMI) ~ scale(log10({exposure} + 1e-07)) + RIDAGEYR + sex + race + INDFMPIR")
exp_name <- ExposureDescription %>% filter(var == exposure) %>% select(var_desc) %>% distinct(var_desc) %>% slice(1) %>% pull(var_desc)
# Check there is match for two data
if(length(table(data$sex[!is.na(data[,exposure])]))<=1 | length(table(data[,exposure])) <= 1){
d <- data.frame(exposure, exp_name, "BMXBMI", fit.estimate=NA, fit.std.error=NA, fit.p.value=NA)
}else{
fit <- svyglm(f, data = data, dsn) %>% tidy()
target <- str_glue('scale(log10({exposure} + 1e-07))')
if(target %in% fit$term){
fit <- fit %>% filter(term == target)
d <- data.frame(exposure, exp_name, "BMXBMI", fit$estimate, fit$std.error, fit$p.value)
}else{
d <- data.frame(exposure, exp_name, "BMXBMI", fit.estimate=NA, fit.std.error=NA, fit.p.value=NA)
}
}
bmi <- rbind(bmi, d)
}
colnames(bmi) <- c("Exposure ID", "Exposure Name", "Phenotype Name", "Estimate", "Standard Error", "P-value")
bmi$FDR <- p.adjust(bmi$`P-value`, method = "fdr")
output <- str_glue("bmi_{flag}.csv")
write.csv(bmi,output, row.names = FALSE)
the training and testing datasets separately. To scale fasting glucose, you can use the scale and log command in the linear modeling function. Your output should be named fasting_glucose_train.csv and fasting_glucose_test.csv.
flag <- "test"
if(flag == "train"){
data <- NHData.train
}else{
data <- NHData.test
}
data <- data %>% filter(!is.na(LBXGLU))
data <- data %>% mutate(sex = case_when( male == 1 ~ "male",
female == 1 ~ "female"),
race = case_when(white == 1 ~ "white",
black == 1 ~ "black",
mexican == 1 ~ "mexican",
other_hispanic == 1 ~ "hispanic",
other_eth == 1 ~ "others"))
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data= data)
glu <- data.frame(matrix(vector(), 0, 6))
for(exposure in ExposureList){
f <- str_glue("scale(log(LBXGLU)) ~ scale(log10({exposure} + 1e-07)) + RIDAGEYR + sex + race + INDFMPIR")
exp_name <- ExposureDescription %>% filter(var == exposure) %>% select(var_desc) %>% distinct(var_desc) %>% slice(1) %>% pull(var_desc)
# Check there is match for two data
if(length(table(data$sex[!is.na(data[,exposure])]))<=1 | length(table(data[,exposure])) <= 1){
d <- data.frame(exposure, exp_name, "LBXGLU", fit.estimate=NA, fit.std.error=NA, fit.p.value=NA)
}else{
fit <- svyglm(f, data = data, dsn) %>% tidy()
target <- str_glue('scale(log10({exposure} + 1e-07))')
if(target %in% fit$term){
fit <- fit %>% filter(term == target)
d <- data.frame(exposure, exp_name, "LBXGLU", fit$estimate, fit$std.error, fit$p.value)
}else{
d <- data.frame(exposure, exp_name, "LBXGLU", fit.estimate=NA, fit.std.error=NA, fit.p.value=NA)
}
}
glu <- rbind(glu, d)
}
colnames(glu) <- c("Exposure ID", "Exposure Name", "Phenotype Name", "Estimate", "Standard Error", "P-value")
glu$FDR <- p.adjust(glu$`P-value`, method = "fdr")
output <- str_glue("glu_{flag}.csv")
write.csv(glu,output, row.names = FALSE)
You can execute your ewas.R script on Orchestra OR on your local computer.
bmi <- read.csv("bmi_train.csv")
bmi %>% filter(!is.na(Estimate)) %>% mutate(threshold = ifelse(Estimate >= 0.2 | Estimate <= -0.2, "Out", "In")) %>% ggplot(aes(x=Estimate, y=-log10(FDR))) +
geom_point(aes(col = threshold), size=1.2) +
scale_colour_manual(values = c("Out" = "Red", "In" = "Black"))
glu <- read.csv("glu_train.csv")
glu %>% filter(!is.na(Estimate)) %>% mutate(threshold = ifelse(Estimate >= 0.2 | Estimate <= -0.2, "Out", "In")) %>% ggplot(aes(x=Estimate, y=-log10(FDR))) +
geom_point(aes(col = threshold), size=1.2) +
scale_colour_manual(values = c("Out" = "Red", "In" = "Black"))
ANS: The vocalno plot shows the significance and effect size at the same time. The higher the point is means it’s more significant, and the outer it is mean it has larger effect size. The reason we scale the variables is because now we can compare all the variables wit different absolute value on the same scale.
mean? (2 points)
files <- c("bmi_train.csv","bmi_test.csv","glu_train.csv","glu_test.csv")
for(file in files){
data <- read.csv(file)
order_data <- data[order(data$FDR),c("P.value","FDR")]
print(order_data[max(which(order_data$FDR < 0.05)),])
print(order_data[max(which(order_data$FDR < 0.01)),])
}
## P.value FDR
## 86 0.02112903 0.04724687
## P.value FDR
## 125 0.002813355 0.009846743
## P.value FDR
## 106 0.02162014 0.04902595
## P.value FDR
## 108 0.003080436 0.009184263
## P.value FDR
## 107 0.004023426 0.0489001
## P.value FDR
## 103 0.0001543363 0.008128381
## P.value FDR
## 13 0.002495111 0.04435753
## P.value FDR
## 139 0.0004497418 0.008994836
ANS: BMI Train: FDR 5%: 0.0211 1%: 0.0028 BMI Test: FDR 5%: 0.0216 1%: 0.0031 Glucose Train: FDR 5%: 0.0040 1%: 0.00015 Glucose Test: FDR 5%: 0.0025 1%: 0.00045
FDR 5% threshold means that in multiple testing, we want to control the expected propotion of the expected proportion of “discoveries” (rejected null hypotheses) that are false (incorrect rejections) to be 5%, and 1% . Same for FDR 1%.
bmi_train <- read.csv("bmi_train.csv")
bmi_test <- read.csv("bmi_test.csv")
bmi_train_list <- bmi_train %>% filter(FDR < 0.1)
bmi_test_list <- bmi_test %>% filter(P.value < 0.05)
bmi_list <- bmi_train_list %>%
inner_join(bmi_test_list, by = "Exposure.ID") %>%
filter(Estimate.x * Estimate.x > 0)
dim(bmi_list)
## [1] 63 13
glu_train <- read.csv("glu_train.csv")
glu_test <- read.csv("glu_test.csv")
glu_train_list <- glu_train %>% filter(FDR < 0.1)
glu_test_list <- glu_test %>% filter(P.value < 0.05)
glu_list <- glu_train_list %>%
inner_join(glu_test_list, by = "Exposure.ID") %>%
filter(Estimate.x * Estimate.x > 0)
dim(glu_list)
## [1] 9 13
ANS: There are 63 replicated findings in BMI, and 9 replicated findings in glucose.
bmi_list %>% arrange(FDR.x) %>% head()
## Exposure.ID Exposure.Name.x Phenotype.Name.x Estimate.x
## 1 LBXGTC g-tocopherol(ug/dL) BMXBMI 0.2276420
## 2 LBXB12 Vitamin B12, serum (pg/mL) BMXBMI -0.2042356
## 3 LBXFOL Folate, serum (ng/mL) BMXBMI -0.1892721
## 4 LBXBPB Lead (ug/dL) BMXBMI -0.2016765
## 5 LBXHPE Heptachlor Epoxide (ng/g) BMXBMI 0.2688981
## 6 LBXBEC trans-b-carotene(ug/dL) BMXBMI -0.2572972
## Standard.Error.x P.value.x FDR.x Exposure.Name.y
## 1 0.009970789 2.616338e-16 4.212304e-14 g-tocopherol(ug/dL)
## 2 0.010279947 4.284396e-15 2.542922e-13 Vitamin B12, serum (pg/mL)
## 3 0.009574916 4.738364e-15 2.542922e-13 Folate, serum (ng/mL)
## 4 0.015785288 2.272531e-11 9.146939e-10 Lead (ug/dL)
## 5 0.028521789 5.380970e-09 1.732672e-07 Heptachlor Epoxide (ng/g)
## 6 0.008080847 7.791652e-09 2.090760e-07 trans-b-carotene(ug/dL)
## Phenotype.Name.y Estimate.y Standard.Error.y P.value.y FDR.y
## 1 BMXBMI 0.2354186 0.01046289 1.120786e-16 9.022329e-15
## 2 BMXBMI -0.1881993 0.01312390 1.210140e-12 2.164806e-11
## 3 BMXBMI -0.1821714 0.01031701 1.769596e-14 4.748415e-13
## 4 BMXBMI -0.2021204 0.01558564 8.831201e-12 1.421823e-10
## 5 BMXBMI 0.2908614 0.03256665 4.483683e-05 4.010405e-04
## 6 BMXBMI -0.2587954 0.01285841 1.167811e-15 6.267251e-14
glu_list %>% arrange(FDR.x) %>% head()
## Exposure.ID Exposure.Name.x Phenotype.Name.x
## 1 LBXGTC g-tocopherol(ug/dL) LBXGLU
## 2 LBXBEC trans-b-carotene(ug/dL) LBXGLU
## 3 LBXCBC cis-b-carotene(ug/dL) LBXGLU
## 4 LBXCRY b-cryptoxanthin(ug/dL) LBXGLU
## 5 LBXLUZ Combined Lutein/zeaxanthin (ug/dL) LBXGLU
## 6 LBXVID Vitamin D (ng/mL) LBXGLU
## Estimate.x Standard.Error.x P.value.x FDR.x
## 1 0.15445792 0.01577668 2.804497e-09 4.431105e-07
## 2 -0.10802779 0.01999974 1.006906e-03 2.656847e-02
## 3 -0.09134860 0.01751955 1.233767e-03 2.784788e-02
## 4 -0.08873553 0.01933813 2.517857e-03 3.978215e-02
## 5 -0.08408332 0.01914369 3.186797e-03 4.577400e-02
## 6 -0.14307110 0.03371682 3.824597e-03 4.890010e-02
## Exposure.Name.y Phenotype.Name.y Estimate.y
## 1 g-tocopherol(ug/dL) LBXGLU 0.14217226
## 2 trans-b-carotene(ug/dL) LBXGLU -0.13970371
## 3 cis-b-carotene(ug/dL) LBXGLU -0.14444696
## 4 b-cryptoxanthin(ug/dL) LBXGLU -0.08109641
## 5 Combined Lutein/zeaxanthin (ug/dL) LBXGLU -0.05561641
## 6 Vitamin D (ng/mL) LBXGLU -0.11891051
## Standard.Error.y P.value.y FDR.y
## 1 0.01710819 3.121283e-08 2.497027e-06
## 2 0.01910467 2.537314e-07 1.353234e-05
## 3 0.01661959 1.451483e-08 2.322372e-06
## 4 0.01968215 4.497418e-04 8.994836e-03
## 5 0.01699393 3.479926e-03 5.061711e-02
## 6 0.01736755 7.063406e-07 2.825363e-05
BMI Top 3 findings: 1. LBXGTC g-tocopherol(ug/dL) : BMI increase by 0.2276420 SD, with 1 SD change 2. LBXB12 Vitamin B12, serum (pg/mL): BMI decrease by 0.2042356, with 1 SD change 3. LBXFOL Folate, serum (ng/mL): BMI decrease by 0.1892721, with 1 SD change
Glucose Top3 findings: 1. LBXGTC g-tocopherol(ug/dL): Glucose increase by 0.15445792 SD, with 1 SD change 2. LBXBEC trans-b-carotene(ug/dL): Glucose decrease by 0.10802779 SD, with 1 SD change 3. LBXCBC cis-b-carotene(ug/dL): Gluose decrease by 0.09134860 SD, with 1 SD change
the estimates (from the training data) by plotting the estimates from body mass index on one axis and glucose on the other (each point is an estimate for a biomarker of exposure). What findings have replicated effects in the same direction? Different directions? What might this imply with respect to these phenotypes? (4 points)
similarity <- data.frame(term = bmi_train$Exposure.ID, bmi_estimate = bmi_train$Estimate, glu_estimate = glu_train$Estimate) %>%
mutate(direction = case_when(bmi_estimate * glu_estimate > 0 ~ "Same Direction",bmi_estimate * glu_estimate <0 ~"Different Direction")) %>% filter(!is.na(direction)) %>%
mutate(replicate = case_when(term %in% bmi_list$Exposure.ID | term %in% glu_list$Exposure.ID ~ "Replicate",
TRUE ~ "Not Replicate")) %>%
mutate(label = case_when(replicate == "Replicate" & direction == "Same Direction" ~ "Replicate-Same",
replicate == "Replicate" & direction == "Different Direction" ~ "Replicate-Different",
TRUE ~ "Non-Replicate"))
ggplot(similarity, aes(x = bmi_estimate, y = glu_estimate, col = label)) + geom_point()
similarity %>% filter(label == "Replicate-Same") %>% select(term) %>% print()
## term
## 1 LBXBPB
## 2 LBXCOT
## 3 URXUBA
## 4 URXUSB
## 5 LBXFOL
## 6 LBXVID
## 7 URXMEP
## 8 URXMZP
## 9 URXMHH
## 10 URXMOH
## 11 LBX074
## 12 LBX105
## 13 LBX118
## 14 LBXD04
## 15 LBXD05
## 16 LBXD07
## 17 LBXF04
## 18 LBXF05
## 19 LBXPCB
## 20 LBX099
## 21 LBX194
## 22 LBX196
## 23 LBD199
## 24 LBXBHC
## 25 LBXPDT
## 26 LBXTNA
## 27 LBXHPE
## 28 LBXDIE
## 29 URXP04
## 30 URXP06
## 31 URXP07
## 32 URXP17
## 33 LBXIRN
## 34 LBXALC
## 35 LBXBEC
## 36 LBXCBC
## 37 LBXCRY
## 38 LBXGTC
## 39 LBXLUZ
similarity %>% filter(label == "Replicate-Different") %>% select(term) %>% print()
## term
## 1 URXUCD
## 2 URXUCS
## 3 URXUTL
## 4 LBXB12
## 5 LBX156
## 6 LBXHXC
## 7 LBX146
## 8 LBX153
## 9 LBX170
## 10 LBX178
## 11 LBX180
## 12 LBX187
## 13 LBX206
## 14 LBXMIR
## 15 URXP02
## 16 URXP19
## 17 LBXLYC
## 18 LBXRPL
## 19 LBXRST
## 20 LBXVIA
## 21 URX14D
## 22 LBXSPH
## 23 URXETL
ANS: There are 39 findings that have replicated effects in either bmi or glucose and have same direction. LBXBPB,LBXCOT,URXUBA,URXUSB,LBXFOL,LBXVID,URXMEP,URXMZP,URXMHH,URXMOH,LBX074,LBX105,LBX118,LBXD04,LBXD05,LBXD07,LBXF04,LBXF05,LBXPCB,LBX099,LBX194,LBX196,LBD199,LBXBHC,LBXPDT,LBXTNA,LBXHPE,LBXDIE,URXP04,URXP06,URXP07,URXP17,LBXIRN,LBXALC,LBXBEC,LBXCBC,LBXCRY,LBXGTC,LBXLUZ
There are 19 findings that have replicated effects and have different direction. URXUCD,URXUCS,URXUTL,LBXB12,LBX156,LBXHXC,LBX146,LBX153,LBX170,LBX178,LBX180,LBX187,LBX206,LBXMIR,URXP02,URXP19,LBXLYC,LBXRPL,LBXRST,LBXVIA,URX14D,LBXSPH,URXETL
When we try to evaluate the association between glucose and bmi, these exposure could be potential confounders that influence the phenotypes.
You just successfully executed cross-sectional associations in two continuous phenotypes, fasting glucose and body mass index.
ANS: A study with individual-level variables that measures exposure and disease at one point in time. A snapshot of the study population. This study design provides weak evidence of causal assocation between exposure and outcome because we may not be certain that the exposure preceded the disease.
ANS: Cox Regression Model.
3.) Two columns in the dataset include PERMTH_EXM and MORTSTAT. Specifically, the US CDC ascertains if individuals surveyed in the NHANES have died in years 2006 and beyond. If they have died at the time of followup, the MORTSTAT variable is equal to 1. If they have not died, the MORTSTAT variable is equal to 0. The time at which death is ascertained from the time an individual surveyed is in the variable PERMTH_EXM. For example, if the PERMTH_EXM variable is equal to 10, and the MORTSTAT is equal to 1, that means the individual died 10 months after the survey when follow. If the MORTSTAT is equal to 1, the individual was alive at the time of follow-up. Use this information to estimate whether any of your top findings in fasting glucose and body mass index is associated with time to death, even after accounting for individual age, sex, race/ethnicity, and income to poverty ratio. Specifically, estimate the risk for death as a hazard ratio for the top finding from your BMI and glucose EWAS. (14 points)
library("survival")
library("survminer")
## Loading required package: ggpubr
## Loading required package: magrittr
survive_data <- NHData.train %>% filter(!is.na(PERMTH_EXM) & !is.na(MORTSTAT)) %>%
mutate(sex = case_when( male == 1 ~ "male",
female == 1 ~ "female"),
race = case_when(white == 1 ~ "white",
black == 1 ~ "black",
mexican == 1 ~ "mexican",
other_hispanic == 1 ~ "hispanic",
other_eth == 1 ~ "others"))
survive_data$race <- factor(survive_data$race, levels = c("white","black","mexican","hispanic","others"))
dsn <- svydesign(ids=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T, data=survive_data)
for(top_exposure in c("LBXGTC","LBXB12","LBXFOL","LBXGTC","LBXBEC","LBXCBC")){
f <- str_glue("Surv(PERMTH_EXM, MORTSTAT) ~ scale({top_exposure})+ RIDAGEYR + sex + race + INDFMPIR")
res_cox <- svycoxph(as.formula(f) , data = survive_data, design = dsn)
print(res_cox %>% tidy())
}
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (57) clusters.
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
## nest = T, data = survive_data)
## term estimate std.error statistic p.value conf.low
## 1 scale(LBXGTC) 0.1148822 0.043461069 2.6433350 8.209376e-03 0.02970004
## 2 RIDAGEYR 0.0843721 0.004143123 20.3643708 0.000000e+00 0.07625173
## 3 sexmale 0.6449254 0.108042939 5.9691580 2.384811e-09 0.43316511
## 4 raceblack 0.3086777 0.129467339 2.3842129 1.711570e-02 0.05492638
## 5 racemexican -0.1510516 0.263271828 -0.5737477 5.661385e-01 -0.66705491
## 6 racehispanic -0.1991432 0.188227720 -1.0579906 2.900597e-01 -0.56806270
## 7 raceothers -0.4408497 0.307442570 -1.4339255 1.515935e-01 -1.04342611
## 8 INDFMPIR -0.2358325 0.038074102 -6.1940405 5.864114e-10 -0.31045640
## conf.high
## 1 0.20006429
## 2 0.09249247
## 3 0.85668565
## 4 0.56242902
## 5 0.36495169
## 6 0.16977640
## 7 0.16172662
## 8 -0.16120866
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (57) clusters.
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
## nest = T, data = survive_data)
## term estimate std.error statistic p.value
## 1 scale(LBXB12) 0.02023399 0.022791024 0.8878052 3.746456e-01
## 2 RIDAGEYR 0.08272030 0.004158308 19.8927795 0.000000e+00
## 3 sexmale 0.62441878 0.111920688 5.5791185 2.417405e-08
## 4 raceblack 0.29612307 0.124469912 2.3790735 1.735621e-02
## 5 racemexican -0.14344298 0.218205240 -0.6573764 5.109389e-01
## 6 racehispanic -0.21822865 0.216585810 -1.0075852 3.136537e-01
## 7 raceothers -0.58585543 0.315841491 -1.8549033 6.361003e-02
## 8 INDFMPIR -0.24476425 0.034186629 -7.1596486 8.087975e-13
## conf.low conf.high
## 1 -0.02443560 0.06490357
## 2 0.07457017 0.09087044
## 3 0.40505826 0.84377929
## 4 0.05216653 0.54007962
## 5 -0.57111739 0.28423144
## 6 -0.64272904 0.20627174
## 7 -1.20489338 0.03318251
## 8 -0.31176881 -0.17775969
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (57) clusters.
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
## nest = T, data = survive_data)
## term estimate std.error statistic p.value
## 1 scale(LBXFOL) 0.01874153 0.040884849 0.4583980 6.466665e-01
## 2 RIDAGEYR 0.08240987 0.004129067 19.9584710 0.000000e+00
## 3 sexmale 0.62979477 0.118216397 5.3274739 9.958806e-08
## 4 raceblack 0.30525581 0.123250144 2.4767177 1.325967e-02
## 5 racemexican -0.13545772 0.216902410 -0.6245100 5.322927e-01
## 6 racehispanic -0.21436894 0.218017710 -0.9832639 3.254776e-01
## 7 raceothers -0.60103379 0.313826645 -1.9151777 5.546985e-02
## 8 INDFMPIR -0.24528664 0.033916288 -7.2321193 4.755085e-13
## conf.low conf.high
## 1 -0.06139130 0.09887436
## 2 0.07431705 0.09050270
## 3 0.39809489 0.86149465
## 4 0.06368997 0.54682166
## 5 -0.56057863 0.28966319
## 6 -0.64167580 0.21293792
## 7 -1.21612271 0.01405513
## 8 -0.31176135 -0.17881194
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (57) clusters.
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
## nest = T, data = survive_data)
## term estimate std.error statistic p.value conf.low
## 1 scale(LBXGTC) 0.1148822 0.043461069 2.6433350 8.209376e-03 0.02970004
## 2 RIDAGEYR 0.0843721 0.004143123 20.3643708 0.000000e+00 0.07625173
## 3 sexmale 0.6449254 0.108042939 5.9691580 2.384811e-09 0.43316511
## 4 raceblack 0.3086777 0.129467339 2.3842129 1.711570e-02 0.05492638
## 5 racemexican -0.1510516 0.263271828 -0.5737477 5.661385e-01 -0.66705491
## 6 racehispanic -0.1991432 0.188227720 -1.0579906 2.900597e-01 -0.56806270
## 7 raceothers -0.4408497 0.307442570 -1.4339255 1.515935e-01 -1.04342611
## 8 INDFMPIR -0.2358325 0.038074102 -6.1940405 5.864114e-10 -0.31045640
## conf.high
## 1 0.20006429
## 2 0.09249247
## 3 0.85668565
## 4 0.56242902
## 5 0.36495169
## 6 0.16977640
## 7 0.16172662
## 8 -0.16120866
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (57) clusters.
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
## nest = T, data = survive_data)
## term estimate std.error statistic p.value
## 1 scale(LBXBEC) -0.24102727 0.059714332 -4.0363387 5.429182e-05
## 2 RIDAGEYR 0.08071093 0.004828037 16.7171307 0.000000e+00
## 3 sexmale 0.63434539 0.138236957 4.5888263 4.457451e-06
## 4 raceblack 0.35099104 0.201563601 1.7413414 8.162375e-02
## 5 racemexican -0.39625303 0.308280259 -1.2853662 1.986643e-01
## 6 racehispanic -0.31042201 0.314730936 -0.9863092 3.239814e-01
## 7 raceothers -1.63268340 0.833080340 -1.9598151 5.001741e-02
## 8 INDFMPIR -0.21854661 0.054030193 -4.0448977 5.234597e-05
## conf.low conf.high
## 1 -0.35806521 -0.1239893290
## 2 0.07124815 0.0901737060
## 3 0.36340593 0.9052848462
## 4 -0.04406636 0.7460484435
## 5 -1.00047123 0.2079651754
## 6 -0.92728331 0.3064392898
## 7 -3.26549087 0.0001240587
## 8 -0.32444384 -0.1126493731
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (57) clusters.
## svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
## nest = T, data = survive_data)
## term estimate std.error statistic p.value
## 1 scale(LBXCBC) -0.15525258 0.057703587 -2.6905188 7.134103e-03
## 2 RIDAGEYR 0.07966211 0.004862256 16.3837753 0.000000e+00
## 3 sexmale 0.64734559 0.135824793 4.7660341 1.878875e-06
## 4 raceblack 0.36339696 0.199164759 1.8246047 6.806070e-02
## 5 racemexican -0.40284446 0.306778973 -1.3131423 1.891350e-01
## 6 racehispanic -0.31545492 0.316828210 -0.9956655 3.194127e-01
## 7 raceothers -1.63145503 0.833753194 -1.9567602 5.037567e-02
## 8 INDFMPIR -0.22610876 0.055369671 -4.0836211 4.433931e-05
## conf.low conf.high
## 1 -0.26834953 -0.042155630
## 2 0.07013226 0.089191955
## 3 0.38113389 0.913557296
## 4 -0.02695880 0.753752713
## 5 -1.00412020 0.198431278
## 6 -0.93642680 0.305516958
## 7 -3.26558126 0.002671203
## 8 -0.33463132 -0.117586196
BMI Top 3 findings: 1. LBXGTC g-tocopherol(ug/dL) : Hazard ratio change by a factor of exp(0.1148822) = 1.121, with 1 SD change of LBXBHC, p-value = 0.0082, which mean LBXGTC significantly increase hazard ratio 2. LBXB12 Vitamin B12(pg/mL): Hazard ratio change by a factor of exp(0.02023399) = 1.021, with 1 SD change of LBX099, p-value = 0.37, which is not significant 3. LBXFOL Folate(ng/mL): Hazard ratio change by a factor of exp(-0.0242372010) = 1.019, with 1 SD change of LBX074, p-value = 0.65, which is not significant
Glucose Top3 findings: 1. LBXGTC g-tocopherol(ug/dL) : Hazard ratio change by a factor of exp(0.1148822) = 1.121, with 1 SD change of LBXBHC, p-value = 0.0082, which mean LBXGTC significantly increase hazard ratio 2. LBXBEC trans-Beta carotene (ug/dL): Hazard ratio change by a factor of exp(-0.24102727) = 0.786, with 1 SD change of LBXPFDO, p-value = \(5.4*10^{-5}\), meaning LBXBEC significantly reduce the hazard ratio. 3. LBXCBC cis-Beta carotene (ug/dL):Hazard ratio change by exp(-0.15525258) = 0.856, with 1 SD change of LBX074, p-value = 0.0071, meaning LBXCBC significantly reduce the hazard ratio